Fast GPGPU Data Rearrangement Kernels using CUDA
نویسندگان
چکیده
* Corresponding author – [email protected]. Graduate student at TUM, work carried out the GE-Global research working towards a master thesis at TUM. Abstract: Many high performance computing algorithms are bandwidth limited, hence the need for optimal data rearrangement kernels as well as their easy integration into the rest of the application. In this work, we have built a CUDA library of fast kernels for a set of data rearrangement operations. In particular, we have built generic kernels for rearranging m dimensional data into n dimensions, including Permute, Reorder, Interlace/Deinterlace, etc. We have also built kernels for generic Stencil computations on a two-dimensional data using templates and functors that allow application developers to rapidly build customized high performance kernels. All the kernels built achieve or surpass best-known performance in terms of bandwidth utilization.
منابع مشابه
Technical Report: GIT-CERCS-09-06 A Characterization and Analysis of GPGPU Kernels
General purpose application development for GPUs (GPGPU) has recently gained momentum as a cost-effective approach for accelerating dataand compute-intensive applications, pushed to the forefront by the introduction of Cbased programming environments such as NVIDIA’s CUDA, [1], OpenCL [2], and Intel’s Ct [3]. While significant effort has been focused on developing and evaluating applications an...
متن کاملData access optimized applications on the GPU using NVIDIA CUDA
This work is an attempt to address the problem of bandwidth limited performance of data intensive GPGPU applications. Performance limited by memory bandwidth is common issue faced by general data intensive HPC applications. In case of the GPU, this problem is more pronounced owing to the unique architecture. This problem has been tackled by optimizing basic data rearrangement operations on the ...
متن کاملDeveloping a High Performance Gpgpu Compiler Using Cetus
In this paper we present our experience in developing an optimizing compiler for general purpose computation on graphics processing units (GPGPU) based on the Cetus compiler framework. The input to our compiler is a naïve GPU kernel procedure, which is functionally correct but without any consideration for performance optimization. Our compiler applies a set of optimization techniques to the na...
متن کاملAdaptable and Efficient Variable Size Template Matching in CUDA
Introduction Increasingly flexible GPUs and the advent of GPGPU (General Purpose GPU) languages, such as Nvidia’s CUDA and the OpenCL standard, offer potential peak performance that far exceeds that of general purpose CPUs for a variety of problems. However, architectural and programming restrictions often prevent programmers from achieving peak performance. Even for problems that map well to c...
متن کاملFATSEA – An Architectural Simulator for General Purpose Computing on GPUs
We present FATSEA, a functional and performance evaluation simulator written in C++ to handle kernels written in the CUDA programming language aimed for GPGPU computing. FATSEA takes a Parallel Thread eXecution (PTX ) code as input, which is a device independent code format generated by the Nvidia CUDA compiler, to validate results and estimate performance on Nvidia platforms. This paper shows ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1011.3583 شماره
صفحات -
تاریخ انتشار 2009